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Abstract 

Shor's quantum algorithm for discrete logarithms applied to elliptic 
curve groups forms the basis of a "quantum attack" of elliptic curve 
cryptosystems. To implement this algorithm on a quantum computer 
requires the efficient implementation of the elliptic curve group opera- 
tion. Such an implementation requires we be able to compute inverses 
in the underlying field. In |PZ03j . Proos and Zalka show how to im- 
plement the extended Euclidean algorithm to compute inverses in the 
prime field GF(p). They employ a number of optimizations to achieve 
a running time of 0(n 2 ), and a space-requirement of 0(n) qubits (there 
are some trade-offs that they make, sacrificing a few extra qubits to 
reduce running-time). In practice, elliptic curve cryptosystems often 
use curves over the binary field GF(2 m ). In this paper, we show how to 
implement the extended Euclidean algorithm for polynomials to com- 
pute inverses in GF(2 m ). Working under the assumption that qubits 
will be an 'expensive' resource in realistic implementations, we opti- 
mize specifically to reduce the qubit space requirement, while keeping 
the running-time polynomial. Our implementation here differs from 
that in jPZ03| for GF(p), and we are able to take advantage of some 
properties of the binary field GF(2 m ). We also optimize the overall 
qubit space requirement for computing the group operation for elliptic 
curves over GF(2 m ) by decomposing the group operation to make it 
"piecewise reversible" (similar to what is done in |PZ03| for curves over 
GF(p)). 
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1 Introduction 



A very significant potential application of quantum computers lies in their 
ability to efficiently solve the problem of finding discrete logarithms over 
finite groups. It is this ability that makes quantum computers capable, in 
principle, of undermining the security of elliptic curve cryptographic sys- 
tems, which are widely used by industry and government to protect sen- 
sitive information. There is no known classical algorithm for solving the 
discrete logarithm problem in polynomial time. In 1994, Peter Shor |Sho94j 
described a quantum algorithm for solving this problem in polynomial time. 

The construction of medium- or large-scale quantum computers has 
turned out to be an enormous technological challenge. For most of the 
proposed (practical) schemes for implementing quantum computers, qubits 
are a very 'expensive' resource. Thus there is a significant practical in- 
terest in optimizing quantum algorithms to use as few qubits as possible. 
In |PZ0.3| . Proos and Zalka give an optimized implementation of the dis- 
crete logarithm algorithm, for the particular case of elliptic curve groups. 
They consider only elliptic curves over the prime fields GF(p). Many el- 
liptic curve cryptosystems use elliptic curves over the binary fields GF(2 m ) 
however. So it is important to examine the number of qubits required to 
implement the discrete logarithm algorithm for elliptic curve groups over 
these binary fields. In this direction, we show how to decompose the group 
operation into a series of smaller, individually reversible, steps (following 
the approach taken in |PZ03j ) . Some of these steps will involve divisions of 
elements in the binary field GF(2 m ). To solve this problem, we show how to 
implement the extended Euclidean algorithm for polynomials, and optimize 
this implementation to use few qubits. 

2 Elliptic curves over GF(2 TO ) 

An elliptic curve over a field F is the set of points (x, y) S F 2 satisfying 

y 2 + a\xy + a^y = x 3 + a2X 2 + a^x + a^, 

subject to some additional conditions on the constants oi, . . . ,05 G F, to- 
gether with a 'point at infinity', denoted 0. For the particular case of curves 
over the finite fields GF(2' m ), the defining equation and additional conditions 
simplify as follows. 

Case 1: a\ ^ (non- super singular curves) 

y 2 + X y = x 3 + ax 2 + b , 6/0. 
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Case 2: a\ = (supersingular curves) 



y 2 + cy = x 3 + ax + b , c ^ 0. 

An elliptic curve over GF(2 m ) is the set of points (x,y) G GF(2 m ) x 
GF(2 m ) that satisfy one of the above two formulae, together with the point 
at infinity 0. A particular curve of one of the above types is specified by 
giving values to the constants a, b (and c in the case of a supersingular 
curve). The set of points on a given elliptic curve forms a group under the 
following operation of addition. Let P = (xx,yi) and R = (x2,yz), where 
P ^ R, be two distinct points on a curve over GF(2 m ). The point P + R is 
defined as follows. 

Case 1: non- supersingular curves 
P + R 

where x 3 = A 2 + A + x\ + x 2 + a , y 3 = X(x\ + x 3 ) + x 3 + yi 



if (^2,2/2) = (xi,x± -y x ) 
( X 3:V3) otherwise, 



Case 2: supersingular curves 

P + R- 



yi + V2 

X\ + X 2 ' 



if (^2,y2) = (zi,yi 

( x 3;2/3) otherwise, 



where x 3 = A 2 + x\ + x 2 , y 3 = H x i + x 3) + Hi + c 

x = VI+V2 
X\ + X 2 

Following the argument in |PZ03) , we can avoid dealing with the cases P = R 
(point doubling) P = —R, and R = 0, and restrict ourselves to the generic 
group addition formulae in terms of x 3 ,y 3 above. The key observation is 
that in a superposition (such as we would have in the quantum discrete 
logarithm algorithm), situations other than the generic case will occur for 
only a small fraction of the elements in superposition, and so by ignoring 
them the fidelity loss will be negligible. 
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3 The discrete logarithm algorithm for elliptic curve 
groups 

Let G be a cyclic group, and let a be a generator for G. The discrete 
logarithm problem with respect to the base a is the following. Given a 
group element (3 G G, find the unique integer d G [0, |G| — 1] such that 
(3 = a d . Recall that Shor's quantum algorithm for solving the discrete 
logarithm problem makes use of a unitary operator that performs 

\x)\y)\z)^\x)\y)\z®a x py), (*) 

where x and y are integers in the range [0, . . . , |G| — 1]. 

Consider an elliptic curve E and let P be a point on E. Consider the 
cyclic subgroup of the elliptic curve group generated by P. We are interested 
in solving the discrete logarithm problem for this subgroup. The group 
operation is written additively, so the discrete logarithm problem is the 
following. Given a point Q in the subgroup generated by P, find the unique 
integer d G [0, . . . , order(P) — 1] such that Q = dP. The unitary operation 
(*) used in Shor's algorithm performs 

\x)\y)\z) -» \x)\y)\z®{xP + yQ)). 

Employing the semiclassical Fourier transform of Griffiths and Niu |GN95| 
as detailed in |PZ03j . for the discrete logarithm algorithm it suffices to be 
able to implement 

\S) — ► \S + A) S,A G E and A is fixed and 'classically known'. 

Writing S = (x,y) and A = (a,f3), we want to implement 

\(x,y)) -» \(x,y) + (a,f3)). 

4 Decomposing the group operation 

We now show how to decompose the group operation for curves over GF(2 m ) 
into a sequence of individually reversible steps. Doing so allows the imple- 
mentation of the group operation with a smaller number of ancillary qubits. 

We will use the following notation. When we write x — > y, we are 
referring to a (not necessarily reversible) computation transforming the value 
x into the value y. When we write x <-> y, we are referring to a reversible 
computation which can be seen as transforming x into y, or as transforming 
y into x. 
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For a fixed point (a, (3), define (x',y') := (x, y) + (a, (3). We want to 
decompose the operation 

10, y)) -» \(x',y')). 

For simplicity, in the following we will write the values without the Dirac 
ket symbols. 

Case 1: non- super singular curves 
We have 

A = y + P = x' + y' 
x + a x' + a 
The group operation is decomposed as 

y + P 



x,y <-> x + a,y + (3 <-> x + a, A 



x + a 



<-> a/ + a, A = - <-> a;' + a, x' + y' <-> x',x' + y' <-> x',y'. 
x' + a 

The second step in the above decomposition is a division, and the 
fourth step is a multiplication, where in each case one of the operands is 
uncomputed in the process. All the other steps involve only additions 
(and the third step also requires the squaring of A). It turns out 
that the number of qubits required to perform the group operation 
is bounded by the number of qubits required to perform a division or 
multiplication where one of the operands is uncomputed in the process. 

Case 2: super singular curves 
We have 

X _ y + P = y' + c + P 

x + a x' + a 
The group operation is decomposed as 

y + P 



x,y «-> x + a,y + P <-> x + a,\ 



x + a 



<-> x + a, A = ; <-> x + a,y + c + p <-> 

ar + q 

As in the non-supersingular case, the second step in the above decom- 
position is a division, and the fourth step is a multiplication, where 
in each case one of the operands is uncomputed in the process. The 
other steps involve only additions, and one squaring. So again the 
qubit-space requirement for the group operation is that for a division 
or multiplication where one of the operands is uncomputed in the pro- 
cess. 
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In both the supersingular and non-supersingular case, the qubit space re- 
quirement of the group operation is determined by that of performing a 
division or multiplication, where one of the operands is uncomputed in the 
process. Such a multiplication can be achieved by running such a division 
backwards, so we turn our attention to implementing divisions of the form 
x,y <-> x,y/x, using as few qubits as possible. Following (PZ03| the divi- 
sion is decomposed into the following four reversible steps. 

x,y 4 l/x,y S l/x,y,y/x & x,y,y/x ™ x,0,y/x. 

The letters over the arrows are m for standard polynomial multiplication, 
and E for "Euclid's algorithm". The second m is really a standard poly- 
nomial multiplication run backwards to uncompute y. We know how to 
implement standard multiplication in GF(2 m ) using 2m qubits by |BBF03j . 
so it remains to show how to implement the extended Euclidean algorithm 
for polynomials to compute inverses in GF(2 m ). 

5 The extended Euclidean algorithm for polyno- 



Suppose A(z) and B{z) are two binary polynomials in the variable z, of 
degrees less than m (i.e. A, B £ GF(2 m )). Suppose A and B are not both 
0, and are such that deg(A) < deg(B). The greatest common divisor of 
A and B, denoted gcd(^4, B), is the binary polynomial of highest degree 
that divides both A and B. The classical Euclidean algorithm for finding 
gcd(A, B) is based on the fact that gcd{A,B) = gcd(B - CA,A), for all 
binary polynomials C. If we divide B by A (by standard long division of 
polynomials), obtaining a quotient polynomial q{z) and a remainder polyno- 
mial r(z) satisfying B = qA + r, then deg(r) < deg(^4). By the fact observed 
above, we have gcd(^4, B) = gcd(r, A). The classical Euclidean algorithm for 
polynomials makes this replacement repeatedly until one of the arguments 
is 0. If we set ro = A and r\ = B, the Euclidean algorithm performs the 
following sequence of divisions: 



mials 



ro = q\ri + r 2 

n = + ^3 



< deg(r2) < deg(ri) 
< deg(r 3 ) < deg(r 2 ) 



r m -2 = q m -\r m -\ + r, 



< deg(r m ) < deg(r 

TO — 1 / 



I. 



The fact above gives us the corresponding sequence of equalities: 



gcd(r ,ri) = gcd(ri,r 2 ) = . . . = gcd(r m _i,r m ) = gcd(r m ,0). 

At this point we have the result, since gcd(r m ,0) = r m . The algorithm is 
guaranteed to terminate, since the degree of one of the arguments strictly 
decreases in each step. Moreover, the algorithm is efficient because the 
number of iterations is bounded by the degree of A (which is at most m). 

Recall that the gcd of two integers a, b can always be written as a linear 
combination of a and b having integral coefficients. The same is true for 
the gcd of two polynomials A, B. That is, there exist polynomials k, k' in 
GF(2 m ) such that 

gcd(A,B) = kA + k'B. 

The extended Euclidean algorithm for polynomials is the same as the Eu- 
clidean algorithm for polynomials except that it also keeps track of the 
'coefficient' polynomials k, k' above. It does so through the following recur- 
rences. 

[1 if J = 

kj = h ifj = l 

[kj- 2 - qj-ikj-i if j > 2 

and 

fo if i = o 

k'j = I 1 if j = 1 

[k'j-2 - qj-ik'j-i if j > 2. 

It is not hard to show that for < j < m we have rj = kjro + k'jr±, where 
the rj's are defined as in the Euclidean algorithm for polynomials, and the 
kj and the k'j are defined by the above recurrences. 

For reference, we write the extended Euclidean algorithm for polynomials 
in pseudo-code below. The notation x <— y is intended to mean that we 
assign the value of y to the variable named x. 
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EXTENDED EUCLIDEAN ALGORITHM FOR POLYNOMIALS 



A 




.4 


Bo 




B 


k * 




1 


fc^ 


- 


fc'o 




- 


k! «- 


1 






A 
_B„ 


r <— 





while r > do 

temp^- k'o — qk' 
k' <- fc' 
fc' *—temp 
temp^- ko — qk 



k * 


- fc 


k «- 


temp 




-B Q 


Bo • 


— r 



r ^ A Q - qB 
return(r, k, k') 

Inverses in GF(2 m ) can be computed using the extended Euclidean algo- 
rithm for polynomials, as follows. Suppose f(z) is an irreducible polynomial 
of degree m, and let C(z) be a binary polynomial of degree < m — 1. Then 
gcd(C, /) = 1, and the extended Euclidean algorithm for polynomials finds 
binary polynomials k and k' such that kC + k'f = 1. But this means that 
kC = l(mod /), and so k = C~ 1 (mod /). The coefficient k' of / is not 
needed for the inversion of C, and so we only need to record the coefficient 
k of C throughout the algorithm. 

6 Naive Implementation of the extended Euclidean 
algorithm for polynomials 

We now turn our attention to quantum implementations of the extended 
Euclidean algorithm for polynomials for computing the inverse of an element 
C. Following |PZ03j . our implementations will maintain two ordered pairs 
(a, A) and (b,B) ., where A and B record the sequence of remainders in 
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the Euclidean algorithm for polynomials, and a and b record the updated 
coefficient of C for each of the past two iterations of the algorithm . We 
call these ordered pairs Euclidean pairs. The algorithm begins with (a, A) = 
(1,C), and (b,B) = (0,/) (where / is an irreducible polynomial of degree 
m). Note that deg(C) < m — 1 < m = deg(/). We will always store the 
Euclidean pair with the smaller-degree polynomial in the second co-ordinate 
first. That is, we store the Euclidean pairs in the order 



where deg(yl) < deg(-B). We then want to perform long division of B 
by A, obtaining a quotient polynomial q and a remainder polynomial r 
satisfying B = qA — r = qA + r (the second equality follows since the field 
is binary), where q is the quotient polynomial of B/A, which we denote as 
q = \B/A\ . We will then replace B by r = B + qA, and b by b + qa. Since 
deg(r) < deg(^4), after the above replacement we will have to interchange the 
Euclidean pairs to maintain the ordering so that the pair with the smaller- 
degree polynomial in the second co-ordinate appears first. So one iteration 
of the algorithm can be written as 

(a, A) , (b, B), — > (b + qa, B + qA) , (a, A) , q where q = [B/A\ . 

At the beginning of the Euclidean algorithm, we start with a = 1, b = 0, A = 
C,B = f, and so deg(^4) < deg(B) and deg(a) > deg(6). It is easy to see 
that this condition is preserved in every iteration of the algorithm. This 



So while q is computed from the second co-ordinates of the Euclidean pairs 
(a, ^4), (b, B), it can be uncomputed from the first coordinates of the modified 
Euclidean pairs (b + qa, B + qA), (a, ^4). Thus each iteration of the Euclidean 
algorithm is individually reversible, and can be written as 



(a,A),(b,B) <-> (b + qa, B + qA) , (a, A) where q = [B/A\ . 
This is decomposed into the following three individually reversible steps: 



(a, A) , (b,B) 




b + qa 



q = 



a 



A, B,0 ^ A, B + qA, q 
a , b , q <-> a + qb , b, 
SWAP 
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where "SWAP" refers to the operation of switching the two Euclidean pairs. 
Since deg(6) < deg(a + (/&), the second operation above is simply the reverse 
of the first operation. 

To perform the division A, B, <-> A, B+qA, q we can use long division 
of the binary polynomial B by A. To implement this long division, the 
basic idea is to shift A all the way to the left (i.e. we shift A left by 
m — deg(A) — 1 bits). Then we start shifting A to the right one bit at 
a time, each time conditionally doing a subtraction. For the binary field 
GF(2 m ) this is simplified by virtue of the fact that subtraction is the same 
as addition, and is achieved by a bitwise XOR operation. This bitwise XOR 
can be implemented quantumly using CNOT gates, and no ancillary qubits. 
(Furthermore, these CNOTs could in principle be performed in parallel, 
allowing us to do addition in a single step.) Note that in our long divisions 
we are doing more work than necessary. Often the degree of B will be less 
than m — 1, and so it would not be necessary to shift A all the way to the 
left (we could just shift it so the most significant bits of A and B line-up). 
For simplicity, in the naive implementation we do not take advantage of this 
fact, but will do so when we look at an optimized implementation. 

6.1 Implementing some tools 

To implement the long division, there are some subcomponents that we 
will need to implement. We describe implementations of some of these 
subcomponents here, optimizing for the number of qubits. 

In what follows, we will show how to implement some operation, and 
then use that operation controlled on the value(s) of some other qubit(s). 
We need to consider whether this can be done without the requirement for 
any additional qubits, or an unreasonable increase in the running time. For- 
tunately, by IBBC+951 , given a gate performing U, we can construct a gate 
performing a controlled-U (that is, U conditioned on a control qubit being in 
state |1)) with no additional ancillary qubits, and a small overhead in run- 
ning time. Using this result repeatedly, we can implement U conditioned on 
any desired pattern of control qubits (e.g. U may be applied only when a 
three-qubits control register is in the state 1 101) ) with no additional ancil- 
lary qubits, and a small overhead in running time. We will use this result 
implicitly in the following. 

For the long division, we will need to compute the degree of A. The 
circuit shown in Figure ^ accomplishes this. Each of the hollow circles 
in the figure denotes a 0- control (that is, the (—1) operation is applied if 
the control qubit is |0)). To uncompute the degree, we can simply run the 
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circuit shown in Figure ^ backwards. The circuit in Figure ^uses a sequence 




-o- 





-1 




-1 






\m — 1) . 

• 
• 









-1 



deg(A)) 



Figure 1: Circuit to compute the degree of A G GF(2 m ). 

of m decrementing (-1) gates, each of which is controlled by the values 
of some of the qubits of \A). These decrementing gates update the value 
of deg(^4), being computed into a [log(m — l)]-qubit register. In Figure 
121 we show how to implement an incrementing (+1) gate using only one 
additional ancillary qubit. The ancillary qubit becomes the most-significant- 
bit of the result. If we only apply the incrementing circuit to integers in 
the range [0, . . . ,m — 2], we know that the ancillary qubit will always be 
|0) at the output. Decrementing is accomplished by running this circuit 
backwards, with the ancillary qubit initially set to |0). As long as we apply 
the decrementing circuit to integers in the range [1 ... m — 1], we know that 
the ancillary qubit will always be |1) at the output. So we can reset the 
ancillary qubit to |0) with a NOT gate after each decrement gate, and re- 
use that ancillary qubit for the next decrement gate. Henceforth when we 
count qubits in this paper, we will always assume [log(m — 1)] = [logm] = 
[log(m + 1)], and write [logm] for convenience. Similarly for [logm]. So 
the degree of A £ GF(2 m ) can be computed using [logm] + 1 qubits (a 
[logm] -qubit register into which the result is computed and stored, and 1 
ancillary qubit shared by the decrementing gates). 

We also need to implement shifts of our quantum registers. For our 
purpose it will suffice to implement a cyclic shift. We will make use of 
the quantum SWAP gate, which swaps two qubits. A SWAP gate can be 
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\ko) 

I ^ Li°s ?ri — ij ^) 

|^Llogm-lJ ~~ "0 
|1) 



-Cr 



r 



-o- 



4. 



-0- 



+ 



Figure 2: Circuit to compute \k) <-> |fc + 1). 



implemented using 3 CNOT gates, and no ancillary qubits, as shown in 
Figure |31 Right shifts can be implemented by an analogous circuit. 




Figure 3: The quantum SWAP gate 



A left cyclic shift gate which shifts the state of an n-qubit register to the 
left cyclically by one qubit is implemented using n — 1 SWAP gates, and no 
ancillary qubits, as shown in Figure |1J 

A left shift of s qubits can be implemented by concatenating s single- 
qubit left shifts together. Note that right shifts can be performed in an 
analogous manner. We will also need to implement a shift conditioned on the 
value contained in a quantum register. That is, a quantum implementation 
of the operation 

\9)\s) «-> |0«s)|«). 

The controlled shift operation above is implemented by the circuit shown in 
Figure [5J where k denotes the number of bits in the binary representation 
of s. 
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: « 1 



Figure 4: A cyclic left shift gate 




<2 A 



< 2 



fc-1 



< 2 



fc-2 



<C 1 



0«a> 



<< s)|s). Here fc = log 2 s, and < 2 A 



is 



Figure 5: Circuit for \0)\s) <-» 
implemented by a sequence of 2 fe <C 1 gates (shown previously). The overall time 
complexity is polynomial in s, and no ancillary qubits are required. 



6.2 Long division 

Now that we can compute the degrees of polynomials in GF(2 m ), and per- 
form shifts of quantum registers, we can state an algorithm to reversibly 
compute the long division 

A,B,0 <-> A,B + qA,q 

(note the algorithm requires deg(^4) < deg(B)). 
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Long Division 

(0) Initialize q = 0. 

(1) Compute deg(A). 

(2) Compute i = m — deg(A) — 1. 

(3) Shift A left by m — deg(A) — 1 positions. 

(4) While i > do 

(4.1) If B i+ deg{A) = 1, then set qi = 1 and replace -B with B ® A. 

(4.2) Shift A to the right one bit. 

(4.3) i «- i - 1. 

(5) Uncompute deg(A). 

At the end of the long division, the register originally containing B will 
contain r = qA + B. Also, the auxiliary counter i will be zeroed, and 
so can be re-used. The conditional setting of qi = 1 in step (4.1) can be 
accomplished by a CNOT gate, with |-Bj+deg(A)) as the control qubit and 
\qi) as the target qubit. Then, conditioned on \qi), the operation \A,B) <-» 
|^4, ^4 © £>) can be accomplished by CNOT gates between the corresponding 
qubits of A and B. To conditionally apply this operation, we replace these 
CNOT gates by Toffoli gates, with \qt) as the additional control qubit. 

7 The Problem of Synchronization 

In the discrete logarithm algorithm, the extended Euclidean algorithm for 
polynomials will be applied to a superposition of inputs. For this reason 
we have to be careful that the steps of the algorithm are appropriately 
synchronized, so that each element in the superposition is undergoing the 
same step at any given time. In the naive implementation described above, 
we shift A left by m-deg(A) — 1 bits. The number of computational steps to 
perform this shift depends on deg(^4). When the computation is applied to 
a superposition of inputs, deg(^4) will be different for the different elements 
in the superposition. Thus the number computational steps is different for 
different elements in superposition. This means the stages of the algorithm 
will not be properly synchronized between elements in superposition. 

This synchronization problem can be solved by applying a general tech- 
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nique of synchronizing the implementation |PZ03j 1 . We explain synchro- 
nization by way of an example. Suppose a computation C consists of some 
sequence of three simple reversible operations 01, 02 and 03 (and no other 
operations). The time taken to perform each of the operations 01,02,03 is 
independent of the input. This means that on a superposition of inputs, the 
time required to perform the operation 01 (for example) is the same for all 
elements in the superposition. 

The quantum computation C is some sequence of the operations 01, 02 
and 03, in any order, and with repetitions. For example, C applied to the 
input basis state \x) might consist of 01 applied 4 times, followed by 02 
applied 1 time, followed by 03 applied 2 times, followed by 01 applied 1 
time, followed by 02 applied 3 times. That is, 

C\x) = 2 2 2 Oi O3O3 2 OiOiOiOi|x). 

The synchronization problem is that for another input basis state (in a 
superposition of inputs), the sequence of operations might be different. For 
example, on \x') the same computation C might consist of 01 applied 1 time, 
followed by 02 applied 4 times, followed by 03 applied 1 time, followed by o\ 
applied 3 times. That is, 

C\x') = O1O1O1 3 O2O2O2O2 oi\x'). 

The idea of synchronization is to have all the computations in the super- 
position cycle through the 3 operations repeatedly, each time allowing the 
computation to either apply the operation once, or not apply it (wait for the 
next operation). The cycle is repeated a sufficient number of times so that 
sufficiently many of the computations in superposition have finished. For 
the computation C above applied to the two input basis states \x) and \x'), 
this is illustrated in Figure EJ In the figure, the operation applied at each 
step are indicated by an x in the corresponding box. We now describe more 
explicitly how to implement synchronization. There must be a way for the 
computation to tell when a series of Oj's is finished and the next one should 
begin. We want to do this reversibly, so there must be a way to tell both 
when an Oj is the first in a series, and when it is last in a series. In each Oj 
we can include a a sequence of gates which flips a flag qubit / if Oj is the 
first in a sequence, and another mechanism that flips / if Oj is the last in a 
sequence. We also make use of a small "counter" register c to control which 
operation is scheduled to be applied at the current step. Thus we have a 

In IPZ03I they refer to the technique as "^synchronization" , but we feel "synchro- 
nizing" is more clear. 
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time 




01 02 03 ox 02 03 cq 02 03 cq 02 03 01 02 03 cq 02 03 01 02 03 01 02 03 



cycle cycle 

Figure 6: synchronization example. 



triple x, f, c where x stands for the actual data. We initialize both / and c 
to 1 to signify that the first operation will be the first in a sequence of o\ 
operations. The physical quantum-gate sequence which we apply is 

... ac ac 03 ac o f 2 ac ac o 3 ac o' 2 ac \x) 

where the o\ are the Oj conditioned on i = c and ac stands for "advance 
counter" . These operations act as follows on the triple: 

o ■ : if i = c : x, f, c «-> Oi(x), /© first © last, c 
ac: x,f, c «-► x, /, (c + /)mod 3 

where d i does nothing if i 7^ c, the symbol "©" means XOR, and (c+/)mod 3 
is taken from {1,2,3}. In the middle of a sequence of o,'s the flag / is 0, 
and so the counter doesn't advance. The last in a sequence of Oj's will set 
/ = 1 and the counter will advance in the next ac step. The first operation 
of the next series resets / to 0, so that this series can progress. 

Of course, even though the individual steps in the algorithm are syn- 
chronized, the computations in the superposition will in general finish the 
extended Euclidean algorithm after different numbers of iterations. For 
those that finish earlier than others, we cannot simply have them "halt" 
and wait for the others to finish (this would result in an implementation 
that is not reversible). To ensure reversibility, those elements in superposi- 
tion that halt early must increment a small counter at each time step until 
the other elements in superposition finish. We will call this small counter 
the "halting counter". 

We do not describe in detail how to apply synchronization to repair the 
naive implementation, but instead proceed with a better optimized imple- 
mentation that will make use of synchronization. 
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8 An optimized implementation 



8.1 The implementation 

The starting point for an optimized implementation is the observation that 
large quotients occur relatively rarely in the extended Euclidean algorithm 
for polynomials. In the naive implementation by shifting A all the way to 
the left in the long divisions, we were doing more work than necessary. Our 
optimized implementation will make use of "adaptive" long divisions, whose 
behaviour is conditioned on the sizes of the arguments. In fact, any 0(n 2 ) 
algorithm (classical or quantum) must do this kind of adaptive division. 
For a quantum implementation, we will then note that since large quotients 
occur rarely, we can bound the size of the quotient with a negligible loss in 
fidelity. 

The other main observation underlying the optimized implementation 
is that in the naive implementation we were using much more space than 
necessary to store the Euclidean pairs. In the naive implementation we used 
a separate m-qubit register for each of A,B,a,b. It turns out that this is 
twice as much space as is necessary. 

Claim 1 At every stage of the extended Euclidean algorithm for polynomials 
we have deg(aB) = m. 

Proof: Initially we have aB = f and so deg(a-B) = m, so the 
claim is true at the first iteration. Each iteration transforms 

a — > a' = b + qa 
B -> B' = A. 

So we have 

deg(a'-B') = deg((6 + qa)A) 

= deg(qaA) (since deg(<7a) > deg(a) > deg(6)) 

= deg(q) + deg(a) + deg(A) 

= deg(B) - deg(A) + deg(a) + deg(A) 

= deg(aB) 

= m 

and so the claim is true after each iteration. □ 
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An immediate corollary of this claim is 

Corollary 1 At every stage of the extended Euclidean algorithm for poly- 
nomials we have 

deg(a) + deg(A) < m and deg(b) + deg(B) < m. 

Proof: Since deg(A) < deg(B) we have 

deg(a) + deg(^4) = deg(o^4) < deg(aB) = m. 

Similarly, since deg(a) > deg(6) we have 

deg(6) + deg(B) = deg(bB) < deg(aB) = m. □ 

By the corollary, we see that a single m-qubit register will be sufficient to 
store both a and A, and a second m-qubit register is sufficient to store both 
b and B. Thus A and a can share a single m-qubit register, and b and B 
can share a second m-qubit register. This reduces the total space to store 
A, B, a, b from 4m to 2m. The problem with this approach is that the relative 
sizes of a and A change from one iteration to the next, and thus so does the 
boundary between A and a within the single m-qubit register (similarly for 
b and B). Further, at any iteration, this boundary may be different between 
elements in superposition. So we need a way to quantumly calculate the 
position of this boundary for each iteration. 

First, observe that the boundary between A and a can be at the same 
position as the boundary between B and b, in any iteration (since deg(A) < 
deg(B)). Second, notice that the boundary can be easily determined if we 
know the degrees of A, B, a, b. It will turn out to be convenient to store A 
and a in a single register in opposing directions. That is, the most significant 
bit of A is at one end of the register, and the most significant bit of a is at 
the extreme other end of the register. Between A and a the register will be 
padded with zeros. Similarly for B and b. The situation for register sharing 
is illustrated in Figure [7| 

From Figure it can be seen that the boundary for register-sharing 
can be determined from deg(a) or from deg(B). Our strategy will be to 
store the degree of each of A, B, a, b at each step, and use either deg(a) or 
deg(-B) (depending on what operation we are performing) to determine the 
boundary. For convenience, we will keep track of the degrees of all of A, B, a 
and b, requiring 4 separate [log m] -qubit registers. 
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Lowest order bit Highest order bit (always 1) 



Figure 7: The positions of A, B, a, b for register sharing. 



As before, we focus on implementing the long division 

A,B,0 <-> A,B + qA,q. 

The long division algorithm is modified slightly as a result of the new strat- 
egy for storing A and B. Note that we do not need to initially shift A all 
the way towards the high order end, since the most significant bits of A and 
B are already in the same position. Instead of shifting A one bit at a time 
towards the low order end at each step, we shift B one bit at a time towards 
the high order end. At each stage, a new bit of q is first read out from the 
high order bit of B. Then, controlled on the new bit of q (equivalently the 
high order bit of B) B is XORed with A (this is the conditional subtrac- 
tion). Then B is shifted towards the high order end by 1 bit, and the value 
of deg(i?) is decremented by 1. Note that no significant bits of B are lost in 
the shift, because after the conditional XOR operation, we know the high 
order bit of B will be 0. After the long division is complete, the remaining 
operation is to shift off any leading (high order) zeros in the final value of 
B, and decrement the value of deg(-B) accordingly. This is done so that the 
most significant bits of A and B are in corresponding positions for the next 
iteration. The operations o\ and 02 for implementing the long division in a 
synchronized manner are as follows: 

01 : (a) The high-order bit of B becomes the next bit of q (starting at 
the high-order bit of q and working down). 

(b) Conditioned on the new bit of q, B is replaced with B © A. 

(c) B is shifted towards the high order end by 1 bit, and deg(2?) is 
decremented by 1. 

02- B is shifted towards the high order end by 1 bit, and deg(-B) is decre- 
mented by 1. 
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The first in a sequence of o\ operations is recognized by the condition q = 0. 
The last in a sequence of o\ operations is recognized by deg{A) = deg(-B). 
When performing the last in a sequence of o± operations, only part (a) is 
performed (so parts (6) and (c) can be conditioned on the flag qubit). The 
first in a sequence of 02 operations is recognized by deg(^4) = deg(B). The 
last in a sequence of 02 operations is recognized when the bit in the high- 
order "slot" of the register containing B is is |1). 

The long division algorithm is illustrated by an example. Suppose we 
have the following: 

A = z 2 + 1 (A = 101) 

B = z 4 + z 2 + l (5 = 10101). 

The long division B j A as would be performed by hand is shown in Figure |S1 
The long division as performed by the algorithm is shown in Figure El One 

1 00 

1 1)10101 
10100 
00001 
01010 
00001 
00101 
00001 

Figure 8: Example long division by hand. 

feature of the algorithm suggested by the example is that the qubits can be 
spatially arranged so that operations are performed on neighbouring qubits. 
Note that in the implementation of shifts (Figure the CNOT gates are 
between adjacent qubits as well). This might be advantageous for a given 
physical implementation. In Figured note that blank cells contain the value 
0, but are shown as blank to make it easier to understand the steps of the 
long division. 

We have omitted the details of how to condition the steps of the long 
division on the value which determines the boundary for register sharing. 
For example, in the implementation of A, B, «-> A, B+qA, q, the operations 
on A, B, q will be conditioned on the value in the register containing deg(a) 
(from which the boundary position for register sharing can be determined). 
These details are very complicated, but the techniques for implementing 
controlled-gates in |BBC+95| indicate that it can be done with no ancillary 
qubits, and a polynomial increase in time. 
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Figure 9: Example of optimized implementation of long division. 
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8.2 Qubit space complexity 



We saw in Section 0] that the number of qubits required to implement the 
elliptic curve group operation is bounded by the number of qubits required 
to implement the extended Euclidean algorithm for polynomials. Here we 
count the number of qubits required by our implementation. 

By using register sharing, the values of A,B,a,b can be stored using 
2m qubits. The values of deg(^4), deg(-B), deg(a), deg(6) must be initially 
computed and stored, requiring 4 [logm] +4 qubits (as seen in Section loTj) . 
We also need to store the value of the quotient q. We noted that in the 
extended Eulcidean algorithm for polynomials large quotients are rare. In 
|PZ03j it is shown that by bounding the size of q to 3 [logm] bits, the total 
loss of fidelity will be at most which is acceptable in the context of Shor's 
algorithm. So we store q in a register of 3 [logm] qubits. 

For the synchronization we need a flag qubit /, and 2-qubit counter 
register c (to index the 4 operations oi(a), oi(b), oi(c), and 02 used in the 
synchronization) . Recall that we also need a "halting counter" , as the com- 
putations in the superposition will finish the extended Euclidean algorithm 
for polynomials after different numbers of iterations. The exact size of this 
halting counter depends on the exact time complexity of the algorithm. 
However, as our implementation is clearly polynomial in m, we know that 
the size of the halting counter will be at most logarithmic in m. We will 
write H for the number of qubits required for the halting counter, where it 
is understood that H is 0(log?n). Such a halting counter would be required 
in any quantum implementation of the extended Euclidean algorithm for 
Polynomials. 

So we have that the qubit space complexity for our implementation of 
the extended Euclidean algorithm for polynomials, and thus of the elliptic 
curve group operation for curves over GF(2 m ), is 
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